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Abstract. When studying convergence of measures, an important is- 
sue is the choice of probability metric. We provide a summary and some 
new results concerning bounds among some important probability met- 
rics/distances that are used by statisticians and probabilists. Knowledge 
of other metrics can provide a means of deriving bounds for another one 
in an applied problem. Considering other metrics can also provide alter- 
nate insights. We also give examples that show that rates of convergence 
can strongly depend on the metric chosen. Careful consideration is nec- 
essary when choosing a metric. 

Abrege. Le choix de metrique de probabilite est une decision tres 
importante lorsqu'on etudie la convergence des mesures. Nous vous 
fournissons avec un sommaire de plusieurs metriques/distances de prob- 
abilite couramment utilisees par des statisticiens(nes) at par des proba- 
bilistes, ainsi que certains nouveaux resultats qui se rapportent a leurs 
bornes. Avoir connaissance d'autres metriques peut vous fournir avec un 
moyen de deriver des bornes pour une autre metrique dans un probleme 
applique. Le fait de prendre en consideration plusieurs metriques vous 
permettra d'approcher des problemes d'une maniere differente. Ainsi, 
nous vous demontrons que les taux de convergence peuvent dependre de 
fagon importante sur votre choix de metrique. II est done important de 
tout considerer lorsqu'on doit choisir une metrique. 



1. Introduction 

Determining whether a sequence of probabihty measures converges is a 
common task for a statistician or probabihst. In many apphcations it is 
important also to quantify that convergence in terms of some probability 
metric; hard numbers can then be interpreted by the metric's meaning, 
and one can proceed to ask quahtative questions about the nature of that 
convergence. 
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Abbreviation 


Metric 


D 


Discrepancy 


H 


Hellinger distance 


I 


Relative entropy (or Kullback-Leibler divergence) 


K 


Kolmogorov (or Uniform) metric 


L 


Levy metric 


P 


Prokhorov metric 


S 


Separation distance 


TV 


Total variation distance 


W 


Wasserstein (or Kantorovich) metric 




distance 



Table 1 . Abbreviations for metrics used in Figure |l[ 



There are a host of metrics available to quantify the distance between 
probability measures; some are not even metrics in the strict sense of the 
word, but are simply notions of "distance" that have proven useful to con- 
sider. How does one choose among all these metrics? Issues that can affect 
a metric's desirability include whether it has an interpretation applicable to 
the problem at hand, important theoretical properties, or useful bounding 
techniques. 

Moreover, even after a metric is chosen, it can still be useful to familiarize 
oneself with other metrics, especially if one also considers the relationships 
among them. One reason is that bounding techniques for one metric can be 
exploited to yield bounds for the desired metric. Alternatively, analysis of a 
problem using several different metrics can provide complementary insights. 

The purpose of this paper is to review some of the most important met- 
rics on probability measures and the relationships among them. This project 
arose out of the authors' frustrations in discovering that, while encyclopedic 
accounts of probability metrics are available (e.g., Rachev (1991)), rela- 
tionships among such metrics are mostly scattered through the literature 
or unavailable. Hence in this review we collect in one place descriptions 
of, and bounds between, ten important probability metrics. We focus on 
these metrics because they are either well-known, commonly used, or admit 
practical bounding techniques. We limit ourselves to metrics between prob- 
ability measures {simple metrics) rather than the broader context of metrics 
between random variables {compound metrics). 

We summarize these relationships in a handy reference diagram (Figure 
HI), and provide new bounds between several metrics. We also give examples 
to illustrate that, depending on the choice of metric, rates of convergence 
can differ both quantitatively and qualitatively. 

This paper is organized as follows. Section |2| reviews properties of our 
ten chosen metrics. Section |^ contains references or proofs of the bounds in 
Figure ||. Some examples of their applications are described in Section ^ In 
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Figure 1. Relationships among probability metrics. A di- 
rected arrow from Aio B annotated by a function h{x) means 
that (Ia < h[dB)- The symbol diam denotes the diameter 
of the probability space Vt; bounds involving it are only use- 
ful if 17 is bounded. For 17 finite, dmm = ''^^^x,y&nd{x^y). 
The probability metrics take arguments fi, v; 'V dom in- 
dicates that the given bound only holds if u dominates n. 
Other notation and restrictions on applicability are discussed 
in Section |^. 
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Section |5| we give examples to show that the choice of metric can strongly 
affect both the rate and nature of the convergence. 



2. Ten metrics on probability measures 

Throughout this paper, let O denote a measurable space with cj-algebra 
B. Let A4 be the space of all probability measures on {^},B). We consider 
convergence in Ai under various notions of distance, whose definitions are 
reviewed in this section. Some of these are not strictly metrics, but are non- 
negative notions of "distance" between probability distributions on that 
have proven useful in practice. These distances are reviewed in the order 
given by Table |l|. 

In what follows, let /i, u denote two probability measures on 0,. Let / 
and g denote their corresponding density functions with respect to a a- 
finite dominating measure A (for example, A can be taken to be (/i + i^)/2). 
If Q = H, let F, G denote their corresponding distribution functions. When 
needed, X, Y will denote random variables on Q such that C{X) = // and 
C{Y) = u. If is a metric space, it will be understood to be a measurable 
space with the Borel cr-algebra. If Q is a bounded metric space with metric 
d, let diam(Q) = sup{d(j;, y) : x, y G denote the diameter of O. 



Discrepancy metric. 

(1) State space: any metric space. 

(2) Definition: 

doifi, 1^) ■■= sup \fj.{B) - i^{B)\. 

all closed balls B 

It assumes values in [0, 1]. 

(3) The discrepancy metric recognizes the metric topology of the un- 
derlying space 17. However, the discrepancy is scale-invariant: mul- 
tiplying the metric of O by a positive constant does not affect the 
discrepancy. 

(4) The discrepancy metric admits Fourier bounds, which makes it useful 
to study convergence of random walks on groups (Diaconis 1988, 
p. 34). 



Hellinger distance. 

(1) State space: Cl any measurable space. 

(2) Definition: if /, g are densities of the measures /x, u with respect to 
a dominating measure A, 



dnip,!^) 



J ft 



1/2 



21 



fg dx 



1/2 
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This definition is independent of the choice of dominating measure 
A. For a countable state space Q,, 



1/2 



(Diaconis and Zabcll 1982). 

(3) It assumes values in [0, \/2]. Some texts, e.g., LeCam (1986), intro- 
duce a factor of a square root of two in the definition of the Hellinger 
distance to normalize its range of possible values to [0, 1]. We follow 
Zolotarev (1983). Other sources, e.g., Borovkov (1998), Diaconis 
and Zabell (1982), define the Hellinger distance to be the square of 
dn- (While dn is a metric, d!jj is not.) An important property is 
that for product measures = /ii x /X2, = i^i x 1^2 on a product 
space rii X ri2, 

1 - \djj{n, u)= (1- ^(fnifii, ui)^ (^1 - ^d|^(/X2, 1^2) 

(Zolotarev 1983, p. 279). Thus one can express the distance between 
distributions of vectors with independent components in terms of the 
component- wise distances. A consequence (Reiss 1989, p. 100) of the 
above formula is djj{p,, v) < vi) + d!jj{iJ,2, 1^2)- 

(4) The quantity (1 — ^c(jj) is called the Hellinger affinity. Appar- 
ently Hellinger (1907) used a similar quantity in operator theory, 
but Kakutani (1948, p. 216) appears responsible for popularizing 
the Hellinger affinity and the form dn in his investigation of infinite 
products of measures. Le Cam and Yang (1990) and Liese and Vajda 
(1987) contain further historical references. 

Relative entropy (or KuUback-Leibler divergence). 

(1) State space: Q, any measurable space. 

(2) Definition: if /, g are densities of the measures ^, v with respect to 
a dominating measure A, 

di{n,v) := / flog{f/g) dX, 

where S{p,) is the support of fi on $7. The definition is independent 
of the choice of dominating measure A. For O a countable space, 

diin, := V ti{uj) log 

The usual convention, based on continuity arguments, is to take 
log ^ = for all real q and p log g = 00 for all real non-zero p. 
Hence the relative entropy assumes values in [0, 00]. 
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(3) Relative entropy is not a metric, since it is not symmetric and does 
not satisfy the triangle inequality. However, it has many useful prop- 
erties, including additivity over marginals of product measures: if 
fJL = Hi X 112-I = z^i X on a product space ili x 

diiiJ,, v) = di{pi,i^i) + di{n2, 1^2) 

(Cover and Thomas 1991, Reiss 1989, p. 100). 

(4) Relative entropy was first defined by Kullback and Leibler (1951) as 
a generalization of the entropy notion of Shannon (1948). A standard 
reference on its properties is Cover and Thomas (1991). 

Kolmogorov (or Uniform) metric. 

(1) State space: = 11. 

(2) Definition: 

dxiF, G) := sup \F{x) - G(x)|, x G R. 

X 

(Since v are measures on M, it is customary to express the Kol- 
mogorov metric as a distance between their distribution functions 
F,G.) 

(3) It assumes values in [0, 1], and is invariant under all increasing one- 
to-one transformations of the line. 

(4) This metric, due to Kolmogorov (1933), is also called the uniform 
metric (Zolotarev 1983). 

Levy metric. 

(1) State space: $7 = 11. 

(2) Definition: 

dL{F, G) := inf{e > : G(a; - e) - e < F{x) < G{x + e) + e,yx e H}. 

(Since f^, v are measures on M, it is customary to express the Levy 
metric as a distance between their distribution functions F, G.) 

(3) It assumes values in [0,1]. While not easy to compute, the Levy 
metric does metrize weak convergence of measures on IR (Lukacs 
1975, p. 71). It is shift invariant, but not scale invariant. 

(4) This metric was introduced by Levy (1925, p. 199-200). 

Prokhorov (or Levy-Prokhorov) metric. 

(1) State space: Q, any metric space. 

(2) Definition: 

dp{ii, u) := inf{e > : n{B) < v{B') + e for all Borel sets B] 

where B"" = {x : infyg^ d{x,y) < e}. It assumes values in [0, 1]. 

(3) It is possible to show that this metric is symmetric in /i, u. See 
(Huber 1981, p. 27). 
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(4) This metric was defined by Prokhorov (1956) as the analogue of the 
Levy metric for more general spaces. While not easy to compute, this 
metric is theoretically important because it mctrizcs weak conver- 
gence on any separable metric space (Huber 1981, p. 28). Moreover, 
dp{iJ,, v) is precisely the minimum distance "in probability" between 
random variables distributed according to u. This was shown by 
Strassen (1965) for complete separable metric spaces and extended 
by Dudley (1968) to arbitrary separable metric spaces. 

Separation distance. 

(1) State space: O a countable space. 

(2) Definition: 

ds{lJi, ■= max ( 1 
» V 

(3) It assumes values in [0, 1]. However, it not a metric. 

(4) The separation distance was advocated by Aldous and Diaconis 
(1987) to study Markov chains because it admits a useful charac- 
terization in terms of strong uniform times. 

Total variation distance. 

(1) State space: O any measurable space. 

(2) Definition: 

(1) dTv{lJ',T^) ■= swp \ n{A) - iy{A) \ 

AcCl 

(2) = - max 

2 \h\<i 

where ^ : $7 — > R satisfies \h{x)\ < 1. This metric assumes values in 
[0,1]. 

For a countable state space U, the definition above becomes 

which is half the L^-norm between the two measures. Some authors 
(for example, Tierney (1996)) define total variation distance as twice 
this definition. 

(3) Total variation distance has a coupling characterization: 
drvifi, = inf{Pr(X ^ Y) : r.v. X, Y s.t. C{X) = fi, C{Y) = v] 





(Lindvall 1992, p. 19). 
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Wasserstein (or Kantorovich) metric. 

(1) State space: K or any metric space. 

(2) Definition: For 0, = TR, if F,G are the distribution functions of fx, v 
respectively, the Kantorovich metric is defined by 

/oo 
\F{x) — G{x)\ dx 
-oo 

= r \F-^{t) -G-'\t)\dt. 

Jo 

Here F^^, G^^ are the inverse functions of the distribution functions 
F, G. For any separable metric space, this is equivalent to 

(3) dwifJ-, v) := sup 

the supremum being taken over all h satisfying the Lipschitz condi- 
tion \h{x) — h{y)\ < d{x,y), where d is the metric on Q. 

(3) The Wasserstein metric assumes values in [0, diam(r2)] , where diam(r2) 
is the diameter of the metric space {il.,d). This metric metrizes weak 
convergence on spaces of bounded diameter, as is evident from The- 
orem I below. 

(4) By the Kantorovich-Rubinstein theorem, the Kantorovich metric is 
equal to the Wasserstein metric: 

dwifi, ly) = inf{E[d(X, Y)] : £{X) = fi,C{Y) = v], 

where the infimum is taken over all joint distributions J with marginals 
11, V. See Szulga (1982, Theorem 2). Dudley (1989, p. 342) traces 
some of the history of these metrics. 

X^-distance. 

(1) State space: any measurable space. 

(2) Definition: if /, g are densities of the measures ^, v with respect to 
a dominating measure A, and S{fi), S{i') are their supports on fi, 

d^2{fi,v) := / — — d\. 
Js(p)VJS(v) 9 

This definition is independent of the choice of dominating measure 
A. This metric assumes values in [0, oo]. For a countable space Q 
this reduces to: 

.,,(M,.):= E '"^"'7 

This distance is not symmetric in fi and v; beware that the order 
of the arguments varies from author to author. We follow Csiszar 
(1967) and Liese and Vajda (1987) because of the remarks following 
equation (Q) below and the natural order of arguments suggested by 



h d[i 



h dv 



< 1 
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inequalities (10) and Reiss (1989, p. 98) takes the opposite 

convention as well as defining the x^-distance as the square root of 
the above expression. 

(3) The x^-distance is not symmetric, and therefore not a metric. How- 
ever, like the Hellinger distance and relative entropy, the x^-distance 
between product measures can be bounded in terms of the distances 
between their marginals. See Reiss (1989, p. 100). 

(4) The x^-distance has origins in mathematical statistics dating back 
to Pearson. See Liese and Vajda (1987, p. 51) for some history. 



We remark that several distance notions in this section are instances of a 
family of distances known as f -divergences (Csiszar 1967). For any convex 
function /, one may define 

(4) dKM,^) = E^H/(44)- 

ui ^ ' 

Then choosing /(x) = (x — 1)^ yields (i^2, fix) = xlogx yields d/, /(x) = 
\x — l|/2 yields drV) and fix) = {^/x — 1)^ yields d^. The family of /- 
divergences are studied in detail in Liese and Vajda (1987). 



3. Some Relationships Among Probability Metrics 

In this section we describe in detail the relationships illustrated in Fig- 
ure 0. We give references for relationships known to appear elsewhere, and 
prove several new bounds which we state as theorems. In choosing the order 
of presentation, we loosely follow the diagram from bottom to top. 

At the end of this section, we summarize in Theorem ^ what is known 
about how these metrics relate to weak convergence of measures. 

The Kolmogorov and Levy metrics on IR. For probability measures 
//, on IR with distribution functions F, G, 

dL{F,G)<dK{F,G). 

See Huber (1981, p. 34). Petrov (1995, p. 43) notes that if G{x) (i.e., u) is 
absolutely continuous (with respect to Lebesgue measure), then 

dKiF, G)<(^1 + sup \G'{x)\^ dL{F, G). 

The Discrepancy and Kolmogorov metrics on IR. It is evident that 
for probability measures on IR, 

(5) dx <dD <2dK. 

This follows from the regularity of Borel sets in IR and expressing closed 
intervals in IR as difference of rays. 
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The Prokhorov and Levy metrics on JR. For probability measures on 
R, 

dL < dp. 

See Huber (1981, p. 34). 

The Prokhorov and Discrepancy metrics. The following theorem shows 
how discrepancy may be bounded by the Prokhorov metric by finding a suit- 
able right-continuous function cj). For bounded Q, (/>(e) gives an upper bound 
on the additional z^-measure of the extended ball over the ball B, where 
B'^ = {x : infy^B d{x,y) < e}. Note that this theorem also gives an upper 
bound for dx through (^) above. 

Theorem 1. Let Q be any metric space, and let v he any probability measure 
satisfying 

y{B') < iy{B)+(j){e) 

for all balls B and complements of balls B and some right- continuous func- 
tion (j). Then for any other probability measure fi, if dp{fi, u) = x, then 
doiii, u) <x-[- (j){x). 

As an example, on the circle or line, \iv = U\s the uniform distribution, 
then (j){x) = 2x and hence 

dD{f^,U) < 3dp{fi,U). 

Proof. For ^, as above, 

fi{B) - z^(B^) > fi{B) - v{B) - ^{x). 

And if dp{fi, u) = x, then /^(-B) — i'{B^) < x for all x > x and all Borel sets 
B. Combining with the above inequality, we see that 

fi{B) - v{B) - (pix) < X. 

By taking the supremum over B which are balls or complements of balls, 
obtain 

sup(/i(S) - u{B)) < x + (f>{x). 

B 

The same result may be obtained for y{B) — n{B) by noting that z^(i?) — 
fi{B) = ^{B'^) — v[B'^) which, after taking the supremum over B which are 
balls or complements of balls, obtain 

suv{v{B) - n{B)) = sup(/i(5^) - i^iB")) <S: + (p{x) 

B B'^- 

as before. Since the supremum over balls and complements of balls will 
be larger than the supremum over balls, if dp{fj,,i') = x, then doifJ',!^) < 
x-\-(l){x) for all X > x. For right-continuous (p, the theorem follows by taking 
the limit as x decreases to x. □ 
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The Prokhorov and Wasserstein metrics. Huber (1981, p. 33) shows 
that 

(dp)^ < d]y < 2 dp 

for probabihty measures on a complete separable metric space whose metric 
d is bounded by 1. More generally, we show the following: 

Theorem 2. The Wasserstein and Prokhorov metrics satisfy 

{dp)^ < dw < (diam(r2) + 1) dp 

where diam(J7) = sup{d(x,2/) : x,y G il}. 

Proof. For any joint distribution J on random variables X, y, 

E J [d{X, Y)] < e • Pr Y) < e) + diam(S7) • Pr{d{X, Y) > e) 
= e + (diam(rj) - e) • Fr{d{X, Y) > e) 

If dp{fi, u) < e, we can choose a coupling so that Fi{d{X, Y) > e) is bounded 
by e (Huber 1981, p. 27). Thus 

Ej[d{X, y)] < e + (diam(17) - e)e < (diam(17) + l)e. 

Taking the infimum of both sides over all couplings, we obtain 

dw < (diam(r2) + l)dp. 

To bound Prokhorov by Wasserstein, use Markov's inequality and choose 
e such that dwitJ-, i^) = e^- Then 

Pr(d(X,y) > e) < ^Ej[d(X,y)] < e 

where J is any joint distribution on X,Y . By Strassen's theorem (Huber 
1981, Theorem 3.7), Vi{d{X,Y) > e) < e is equivalent to ii{B) < u{B') + e 
for all Borel sets B, giving dp < dw- D 

No such upper bound on dw holds if Q is not bounded. Dudley (1989, 
p. 330) cites the following example on M. Let 6x denote the delta measure 
at X. The measures Pn ■= {{n — l)6o + Sn) /n converge to P := 5o under the 
Prokhorov metric, but dwiPm P) = 1 for all n. Thus Wasserstein metric 
metrizes weak convergence only on state spaces of bounded diameter. 

The Wasserstein and Discrepancy metrics. The following bound can 
be recovered using the bounds through total variation (and is therefore not 
included on Figure ||), but we include this direct proof for completeness. 

Theorem 3. If ft is finite, 

dmin ■ do < dw 
where dmin = ^^i^xj^y d{x , y) over distinct pairs of points in 
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Proof. In the equivalent form of the Wasserstein metric, Equation (^), take 

^ I ^min for X in B 
1^ otherwise 

for B any closed ball. h{x) satisfies the Lipschitz condition. Then 

dmin-lAi(S)- 1^(5)1 = 



hdfj. — / hdu 
'n Jn 

< dw{f^,i') 

and taking B to be the ball that maximizes \fi{B) — i'{B)\ gives the result. □ 

On continuous spaces, it is possible for dw to converge to while dn 
remains at 1. For example, take delta measures 6^ converging on 6o. 

The Total Variation and Discrepancy metrics. It is clear that 

(6) do < dxv 

since total variation is the supremum over a larger class of sets than discrep- 
ancy. 

No expression of the reverse type can hold since the total variation dis- 
tance between a discrete and continuous distribution is 1 while the discrep- 
ancy may be very small. Further examples are discussed in Section ^. 

The Total Variation and Prokhorov metrics. Huber (1981, p. 34) 
proves the following bound for probabilities on metric spaces: 

dp < dTV- 

The Wasserstein and Total Variation metrics. 

Theorem 4. The Wasserstein metric and the total variation distance satisfy 
the following relation: 

dw < diam(r2) • dxv 
where diam{Q) = sup{d(x,y) : x,y G O}. If Q is a finite set, there is a 
bound the other way. // dmin = ^^^x^yd{x,y) over distinct pairs of points 
in 17, then 

(7) dmin • dTV < dw- 

Note that on an infinite set no such relation of the second type can occur 
because dw may converge to while dxv remains fixed at 1. (min^^j^ d{x, y) 
could be on an infinite set.) 

Proof. The first inequality follows from the coupling characterizations of 
Wasserstein and total variation by taking the infimum of the expected value 
over all possible joint distributions of both sides of: 

d{X,Y) < Ix^y •diam(S7). 

The reverse inequality follows similarly from: 

d{X,Y) > lx=^y • mind(a,6). 
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□ 

The Hellinger and Total Variation metrics. 

(8) ^ < dTV < dH. 
See LeCam (1969, p. 35). 

The Separation distance and Total Variation. It is easy to show (see, 
e.g., Aldous and Diaconis (1987, p. 71)) that 

(9) drv < ds. 

As Aldous and Diaconis note, there is no general reverse inequality, since 
if /X is uniform on {1,2, ...,n} and u is uniform on {1,2, ...,n — 1} then 
drvifJ', ^) — l/*^ but dsin, v) = 1. 

Relative Entropy and Total Variation. For countable state spaces Q, 

2 {drvf < di. 

This inequality is due to Kullback (1967). Some small refinements are pos- 
sible where the left side of the inequality is replaced with a polynomial in 
drv with more terms; see Mathai and Rathie (1975, p. 110-112). 

Relative Entropy and the Hellinger distance. 

{dnf < di. 

See Reiss (1989, p. 99). 

The x^-distance and Hellinger distance. 

dHif^,i^) < V2{ d^2{ii,v)fl^. 

See Reiss (1989, p. 99), who also shows that if the measure ji is dominated 
by V, then the above inequality can be strengthened: 

dnili.v) < (d^2(/i, i/))^/^. 
The x^-distance and Total Variation. For a countable state space Q,, 

where the inequality follows from Cauchy-Schwarz. On a continuous state 
space, if ii is dominated by v the same relationship holds; see Reiss (1989, 
p. 99). 
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The x^-distance and Relative Entropy. 

Theorem 5. The relative entropy di and the \^-distance d^2 satisfy 

(10) di{fi,u) <log[l + d^2{n,u)]. 
In particular, dj{fj,,i') < d^2(/i,z^). 

Proof. Since log is a concave function, Jensen's inequality yields 

di{fi,u) < log i^jj^fld) f dX^ < log (l + d^2{n,u)) < d^2{^,v), 
where the second inequality is obtained by noting that 

Jn 9 Jn\ 9 J Jn 9 

□ 

Diaconis and Saloff-Coste (1996, p. 710) derive the following alternate 
upper bound for the relative entropy in terms of both the total 
variation distances. 

(11) dj{fj.,u) < dTv{l^,T^) + ^d^2{fi,iy). 

Weak convergence. In addition to using Figure to recall specific bounds, 
our diagram there can also be used to discern relationships between topolo- 
gies on the space of measures. For instance, we can see from the mutual 
arrows between the total variation and Hellinger metrics that they generate 
equivalent topologies. Other mutual arrows on the diagram indicate similar 
relationships, subject to the restrictions given on those bounds. 

Moreover, since we know that the Prokhorov and Levy metrics both 
metrize weak convergence, we can also tell which other metrics metrize weak 
convergence on which spaces, which we summarize in the following theorem: 

Theorem 6. For measures on M, the Levy metric metrizes weak conver- 
gence. Convergence under the discrepancy and Kolmogorov metrics imply 
weak convergence (via the Levy metric). Furthermore, these metrics metrize 
weak convergence fin ^ if the limiting metric v is absolutely continuous 
with respect to Lebesgue measure on M. 

For measures on a measurable space 0,, the Prokhorov metric metrizes 
weak convergence. Convergence under the Wasserstein metric implies weak 
convergence. 

Furthermore, ifVt is bounded, the Wasserstein metric metrizes weak con- 
vergence (via the Prokhorov metric), and convergence under any of the fol- 
lowing metrics implies weak convergence: total variation, Hellinger, separa- 
tion, relative entropy, and the x^-distance. 

If Q. is both hounded and finite, the total variation and Hellinger metrics 
both metrize weak convergence. 
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This follows from chasing the diagram in Figure ||, noting the existence 
of mutual bounds of the Levy and Prokhorov metrics with other metrics 
(using the results surveyed in this section) and reviewing conditions under 
which they apply. 

4. Some Applications of Metrics and Metric Relationships 

We describe some of the applications of these metrics in order to give 
the reader a feel for how they have been used, and describe how some au- 
thors have exploited metric relationships to obtain bounds for one metric 
via another. 

The notion of weak convergence of measures is an important concept in 
both statistics and probability. For instance, when considering a statistic T 
that is a functional of an empirical distribution F, the "robustness" of the 
statistic under small deviations of F corresponds to the continuity of T with 
respect to the weak topology on the space of measures. See Huber (1981). 
The Levy and Prokhorov metrics (and Wasserstein metric on a bounded 
state space) provide quantitative ways of metrizing this topology. 

However, other distances that do not metrize this topology can still be 
useful for other reasons. The total variation distance is one of the most com- 
monly used probability metrics, because it admits natural interpretations as 
well as useful bounding techniques. For instance, in (|l|), if A is any event, 
then total variation can be interpreted as an upper bound on the difference 
of probabilities that the event occurs under two measures. In Bayesian sta- 
tistics, the error in an expected loss function due to the approximation of 
one measure by another is given (for bounded loss functions) by the total 
variation distance through its representation in equation (|2|). 

In extending theorems on the ergodic behavior of Markov chains on dis- 
crete state spaces to general measurable spaces, the total variation norm is 
used in a number of results (Orey 1971, Nummelin 1984). More recently, to- 
tal variation has found applications in bounding rates of convergence of ran- 
dom walks on groups (e.g., Diaconis (1988), Rosenthal (1995)) and Markov 
chain Monte Carlo algorithms (e.g., Tierney (1994), Gilks, Richardson and 
Spiegelhalter (1996)). Much of the success in obtaining rates of convergence 
in these settings is a result of the coupling characterization of the total 
variation distance, as well as Fourier bounds. 

Gibbs (2000) considers a Markov chain Monte Carlo algorithm which 
converges in total variation distance, but for which coupling bounds are 
difficult to apply since the state space is continuous and one must wait for 
random variables to couple exactly. The Wasserstein metric has a coupling 
characterization that depends on the distance between two random variables, 
so one may instead consider only the time required for the random variables 
to couple to within e, a fact exploited by Gibbs. For a related example with 
a discrete state space, Gibbs uses the bound (|7|) to obtain total variation 
bounds. 
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Like the total variation property (|2|), the Wasserstein metric also repre- 
sents the error in the expected value of a certain class of functions due to 
the approximation of one measure by another, as in (^), which is of interest 
in applications in Bayesian statistics. The fact that the Wasserstein metric 
is a minimal distance of two random variables with fixed distributions has 
also led to its use in the study of distributions with fixed marginals (e.g., 
Riischendorf, Schweizer and Taylor (1996)). 

Because the separation distance has a characterization in terms of strong 
uniform times (like the coupling relationship for total variation), conver- 
gence of a Markov chain under the separation distance may be studied by 
constructing a strong uniform time for the chain and estimating the proba- 
bility in the tail of its distribution. See Aldous and Diaconis (1987) for such 
examples; they also exploit inequality @ to obtain upper bounds on the 
total variation. 

Similarly, total variation lower bounds may be obtained via (^) and lower 
bounds on the discrepancy metric. A version of this metric is popular among 
number theorists to study uniform distribution of sequences (Kuipers and 
Niederreiter 1974); Diaconis (1988) suggested its use to study random walks 
on groups. Su (1998) uses the discrepancy metric to bound the convergence 
time of a random walk on the circle generated by a single irrational rotation. 
This walk converges weakly, but not in total variation distance because its 
n-th step probability distribution is finitely supported but its limiting mea- 
sure is continuous (in fact, uniform). While the Prokhorov metric metrizes 
weak convergence, it is not easy to bound. On the other hand, for this 
walk, discrepancy convergence implies weak convergence when the limiting 
measure is uniform; and Fourier techniques for discrepancy allow the calcu- 
lation of quantitative bounds. The discrepancy metric can be used similarly 
to study random walks on other homogeneous spaces, e.g., Su (2001). 

Other metrics are useful because of their special properties. For instance, 
the Hellinger distance is convenient when working with convergence of prod- 
uct measures because it factors nicely in terms of the convergence of the com- 
ponents. Reiss (1989) uses this fact and the relationships (^) between the 
Hellinger and total variation distances to obtain total variation bounds. The 
Hellinger distance is also used in the theory of asymptotic efficiency (e.g., 
see LeCam (1986)) and minimum Hellinger distance estimation (e.g., see 
Lindsay (1994)). It is used throughout Ibragmiov and Has'minskii (1981) to 
quantify the rate of convergence of sequences of consistent estimators to their 
parameter. Kakutani (1948) gives a criterion (now known as the Kakutani 
alternative) using the Hellinger affinity to determine when infinite products 
of equivalent measures are equivalent; this has applications to stochastic 
processes and can be used to show the consistency of the likelihood-ratio 
test. See Jacod and Shiryaev (1987), Williams (1991) for applications. 

Diaconis and Saloff-Coste (1996) use log-Sobolev techniques to bound the 

convergence of Markov chains to their limiting distributions, noting that 
these also give total variation and entropy bounds. The x^-distance bears 
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its name because in the discrete case it is the weU-known statistic used, 
for example, in the classic goodness-of-fit test, e.g., see Borovkov (1998, 
p. 184). Similarly, the Kolmogorov metric between a distribution function 
and its empirical estimate is used as the test statistic in the Kolmogorov- 
Smirnov goodness-of-fit test, e.g., see Lehmann (1994, p. 336). 

Relative entropy is widely used because it is a quantity that arises natu- 
rally, especially in information theory (Cover and Thomas 1991). Statistical 
applications include proving central limit theorems (Linnik 1959, Barron 
1986) and evaluating the loss when using a maximum likelihood versus a 
Bayes density estimate (Hartigan 1998). In the testing of an empirical dis- 
tribution against an alternative, Borovkov (1998, p. 256) gives the relation- 
ship of the asymptotic behaviour of the Type II error to the relative entropy 
between the empirical and alternative distributions. In Bayesian statistics, 
Bernardo and Smith (1994, p. 75) suggest that relative entropy is the nat- 
ural measure for the lack of fit of an approximation of a distribution when 
preferences are described by a logarithmic score function. 

Up to a constant, the asymptotic behaviour of relative entropy, the Hellinger, 
and the x^-distance are identical when the ratio of the density functions is 
near 1 (Borovkov 1998, p. 178). These three distances are used extensively 
in parametric families of distributions to quantify the distance between mea- 
sures from the same family indexed by different parameters. Borovkov (1998, 
pp. 180-181) shows how these distances are related to the Fisher information 
in the limit as the difference in the parameters goes to zero. 

5. Rates of Convergence that Depend on the Metric 

We now illustrate the ways in which the choice of metric can affect rates of 
convergence in one context: the convergence of a random walk to its limiting 
distribution. Such examples point to the need for practitioners to choose 
a metric carefully when measuring convergence, paying attention to that 
metric's qualitative and quantitative features. We give several examples of 
random walks whose qualitative convergence behavior depend strongly on 
the metric chosen, and suggest reasons for this phenomenon. 

As a first basic fact, it is possible for convergence to occur in one metric 
but not another. An elementary example is the convergence of a stan- 
dardized Binomial {n,p) random variable with distribution /i„ which con- 
verges to the standard normal distribution, i/, as n — > oo. For all n < oo, 
dTv{lJ"n, i^) = 1) while dodj-n, i') — ^ as n — > oo. In the random walk con- 
text, Su (1998) shows that a random walk on the circle generated by an 
irrational rotation converges in discrepancy, but not total variation. The 
latter fact follows because the n-th step probability distribution is finitely 
supported, and remains total variation distance 1 away from its continuous 
limiting distribution. 

However, more interesting behavior can arise. Below, we cite a family 
of random walks on a product space, indexed by some parameter, which 
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not only converges under each of total variation, relative entropy, and the 
X^-distance, but exhibits different rates of convergence as a function of the 
parameter. 

We then cite another example of a family of walks that not only has 
different convergence rates under two different metrics, but also exhibits 
qualitatively different convergence behavior. This family exhibits a cutoff 
phenomenon under the first metric but only exponential decay under the 
second. 

Example: convergence rates that depend on the metric. The follow- 
ing family of random walks show that convergence rates in total variation, 
relative entropy, and x^-distance may differ. Recall that a family of random 
walks, indexed by some parameter n, is said to converge in f{n) steps us- 
ing some metric/distance if that metric/distance can be uniformly bounded 
from above after f{n) steps and is uniformly bounded away from zero before 
f{n) steps. 

Let G = Z mod g, a finite group with g elements. Then G" is the set 
of all n-tuples of elements from G. Consider the following continuous-time 
random walk on G": start at (0, 0, 0) and according to a Poisson process 
running at rate 1, pick a coordinate uniformly at random and replace that 
coordinate by a uniformly chosen element from G. (Thus each coordinate is 
an independent Poisson process running at rate ^.) 

Su (1995) proves that if g grows with n exponentially according to g = 2"', 
then the random walk on G" described above converges in 2nlogn steps 
under the relative entropy distance, and log 2 steps under the x^-distance. 
However, it converges in at most nlogn steps under total variation. 

In this example, the relative entropy and x^-distance are calculated by 
using properties of these metrics, while the total variation upper bound in 
this example is, in fact, calculated via its metric relationship (^) with the 
separation distance. 

Why does the difference in rates of convergence occur? This is due to the 
fact that total variation distance, unlike relative entropy or the x^-distance, 
is insensitive to big or small elementwise differences when the total of those 
differences remains the same. For instance, consider the follow measures 

H, V on ZiQ. Let 

I !lf ^*"?oo. f 0.2 if i = 0,1,2,3,4 

M^)= 0.1 1 . = 1,2,3,4 -(^) = { ,1,, ' ' ' ' . 

else 

Let U be the uniform distribution on Ziq. In this example di{fi,U) ~ 

I. 075 and dj{i', U) = 0.693, but the total variation distances are the same: 
dTvifJ-jU) = drvii^^U) = 0.5. This is true because we could redistribute 
mass among the points on which fj, exceeded v without affecting the total 
variation distance. 
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Thus it is possible for an unbalanced measure (with large elementwise 
differences from uniform) to have the same total variation distance as a 
more balanced measure. The random walk on the product space has a 
natural feature which rewards total variation distance but hinders relative 
entropy — the randomization in each coordinate drops the variation distance 
quickly, but balancing the measure might take longer. 

Example: qualitatively different convergence behavior. Chung, Di- 
aconis and Graham (1987) study a family of walks that exhibits a cutoff 

phenomenon under total variation; the total variation distance stays near 1 
but drops off sharply and steeply to zero after a certain cutoff point. The 
family of random walks is on the integers mod p where p is an odd num- 
ber, with a randomness multiplier: the process is given by Xq = and 
Xn = 2Xn-i + en (modp) where the ej are i.i.d. taking values 0, ±1 each 
with probability 1/3. The stationary distribution for this process is uni- 
form. Using Fourier analysis, Chung et al. (1987) show for the family of 
walks when p = 2* — 1 that 0(log2ploglog2p) steps are sufficient and nec- 
essary for total variation convergence, for t a positive integer. There is, in 
fact, a cutoff phenomenon. 

However, as proven in Su (1995, pp. 29-31), O(logp) steps are sufficient for 
convergence in discrepancy. Moreover, the convergence is qualitatively dif- 
ferent from total variation, because there is no cutoff in discrepancy. Rather, 
there is only exponential decay. 

Again, analysis under these two metrics sheds some light on what the 
walk is doing. For p = 2* — 1, the "doubling" nature of the process keeps 
the walk supported very near powers of 2. The discrepancy metric, which 
accounts for the topology of the space, falls very quickly because the walk 
has spread itself out even though it is supported on a small set of values. 
However, the total variation does not "see" this spreading; it only falls when 
the support of the walk is large enough. 

Thus, the difference in rates of convergence sheds some light on the nature 
of the random walk as well as the metrics themselves. 
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